4/25/23
Medical Physicist (2011-2014);
Master of Science in Biotechnology (2017);
PhD in Biotechnology (2023);
Since 2015 I work focused on bone tissue, especially with cell-biomaterial interaction;
Masters: Kinome;
PhD: Transcriptome/Proteome/Epigenome;
Data to wisdom
Data to conspiracy
Data science vs Bioinformatics.
NIH: “Bioinformatics, as related to genetics and genomics, is a scientific subdiscipline that involves using computer technology to collect, store, analyze and disseminate biological data and information, such as DNA and amino acid sequences or annotations about those sequences.”
Consortiums
These are biological science (Online) libraries, collected from scientific experiments, published literature, high-throughput experiment technology, and computational analysis;
Structured annotations!
Provide access to data programmatically (API);
Data is made freely available under certain licenses;
Microarrays, RNA-Seq, miRNA-Seq, Proteomics (few), CHIP-seq, ATAC-seq, Kinase Array, Single Cell, …
Why you MUST publish your data from high-throughput experiments?
It allows reuse by the scientific community;
LIVES;Peer-review;
Open-data/Open science;
Good practices;
Advancing RNA-Seq analysis, Nature Biotechnology 28:421-423
GEO2R is an interactive web tool that allows users to compare two or more groups of Samples in a GEO Series in order to identify genes that are differentially expressed across experimental conditions.
Microarray - LIMMA;
RNAseq - DeSeq2(BETA);
Query: “osteogenesis”/ “mus musculus”/ “Expression profilling by array”;
Data matrix (Or Expression matrix):
Values of counts read by the equipment;
Processed OR not;
Investigation Design Format (IDF):
Sample and Data Relationship Format (SDRF):
Array Designs Format (ADF);
Import the data ({readr});
ExpMat: ExpressionMatrix or assayData;
adf : Array description file or featureData:
sdf : Sample description file or phenoData;
Data cleaning;
Golden rule:
number of ExpMat rows equals the number of adf rows.
row names and row order of ExpMat equals of adf rows.
number of ExpMat columns equals the number of sdf rows.
column names and order of ExpMat equals of sdf rows.
ExpressionSet: R object from the {Biobase};
AnnotatedDataFrame function on adf and sdf;
LIMMA
Quality control: plotDensities(), PCA;
Background correction: backgroundCorrect() ;
Normalization: normalizeBetweenArrays();
LIMMA - differential expression
Create a design: model.matrix();
Fit the data: lmFit();
Make the contrast: makeContrasts();
Fit the contrast: contrasts.fit();
Compute the statistic: eBayes();
LIMMA - differential expression
Extract the differential expressed genes table: topTable();
{TCGABiolinks};
R Package;
Facilitates data access and analysis.
miRNA example - Pancreatic Ductal Adenocarcinoma Study
CancerProject <- "TCGA-PAAD"
DataDirectory <- paste0("/GDC/",gsub("-","_",CancerProject))
FileNameData <- paste0(DataDirectory, "_","miRNA_gene_quantification",".rda")
query.miR <- GDCquery(project = CancerProject,
data.category = "Transcriptome Profiling",
data.type = "miRNA Expression Quantification",
#file.type = "hg19.mirna",
legacy = FALSE)samplesDown.miR <- getResults(query.miR,cols=c("cases"))
dataSmTP.miR <- TCGAquery_SampleTypes(barcode = samplesDown.miR,
typesample = "TP")
dataSmNT.miR <- TCGAquery_SampleTypes(barcode = samplesDown.miR,
typesample = "NT")
queryDown.miR <- GDCquery(project = CancerProject,
data.category = "Transcriptome Profiling",
data.type = "miRNA Expression Quantification",
#file.type = "hg19.mirna",
legacy = FALSE,
barcode = c(dataSmTP.miR, dataSmNT.miR))# using read_count's data
read_countData <- colnames(dataAssy.miR)[grep("count", colnames(dataAssy.miR))]
dataAssy.miR <- dataAssy.miR[,read_countData]
colnames(dataAssy.miR) <- gsub("read_count_","", colnames(dataAssy.miR))
dataFilt <- TCGAanalyze_Filtering(tabDF = dataAssy.miR,
method = "quantile",
qnt.cut = 0.25)
dataDEGs <- TCGAanalyze_DEA(mat1 = dataFilt[,dataSmNT.miR],
mat2 = dataFilt[,dataSmTP.miR],
Cond1type = "Normal",
Cond2type = "Tumor",
fdr.cut = 0.01 ,
logFC.cut = 1,
method = "glmLRT") Enrichment analysis: {clusterProfiler}
Gene Onology;
KEGG pathways;
Gene set enrichment: {fGSEA}, {GSVA}
Molecular signatures database (msigdb);
Network interactions
Marcel Ferreira, PhD (@marceelrf)
https://quartodomarcel.netlify.app/
https://marcel-ferreira.shinyapps.io/SciDashboard_marceelrf/
https://learn.gencore.bio.nyu.edu/rna-seq-analysis/
https://star-protocols.cell.com/protocols/931
https://statquest.org/
https://www.youtube.com/watch?v=tlf6wYJrwKY
https://home.proffernandamaciel.com.br/
https://sydney-informatics-hub.github.io/training-RNAseq-slides/01_IntroductionToRNASeq/01_IntroductionToRNASeq.html#1
https://bioconductor.org/packages/release/bioc/vignettes/limma/inst/doc/usersguide.pdf